library(tidyverse)
library(mice)
library(car)
library(MASS)HW 4 Missing Data & Model Selection: Data Analysis Problems
Advanced Regression (STAT 353-0)
Zihan Chu
February 28, 2025
All students are required to complete this problem set!
Data analysis problems
1. Exercise D20.1 (MI)
Using the United Nations social-indicators data (in UnitedNations.txt), develop a regression model for the response variable female expectation of life. Feel free to use whatever explanatory variables in the data set make sense to you and to employ variable transformations, if needed.
- Work initially with complete cases, and once you have an apparently satisfactory model, obtain estimates and standard errors of the regression coefficients.
united_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/UnitedNations.txt")
united_data$region <- as.factor(united_data$region)
united_data_complete <- na.omit(united_data)
# Fit a linear regression model using complete cases
model_complete <- lm(lifeFemale ~ ., united_data_complete)
summary(model_complete)
Call:
lm(formula = lifeFemale ~ ., data = united_data_complete)
Residuals:
Min 1Q Median 3Q Max
-2.07954 -0.61680 0.03489 0.51446 1.60556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.375e+01 6.740e+00 3.524 0.00182 **
regionAmerica 3.999e-01 1.129e+00 0.354 0.72643
regionAsia -1.335e-01 1.124e+00 -0.119 0.90643
regionEurope 2.248e+00 1.151e+00 1.954 0.06295 .
regionOceania -4.413e-01 1.926e+00 -0.229 0.82081
tfr -1.362e+00 5.801e-01 -2.347 0.02788 *
contraception -2.341e-02 2.044e-02 -1.146 0.26374
educationMale 5.993e-01 5.222e-01 1.148 0.26294
educationFemale -6.592e-01 4.926e-01 -1.338 0.19388
lifeMale 7.815e-01 8.484e-02 9.211 3.52e-09 ***
infantMortality -1.193e-02 2.853e-02 -0.418 0.67968
GDPperCapita 8.485e-05 5.089e-05 1.667 0.10903
economicActivityMale 4.226e-02 4.414e-02 0.957 0.34828
economicActivityFemale -1.883e-02 1.865e-02 -1.009 0.32341
illiteracyMale 1.001e-01 5.422e-02 1.846 0.07773 .
illiteracyFemale -1.227e-01 4.840e-02 -2.535 0.01850 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.247 on 23 degrees of freedom
Multiple R-squared: 0.9906, Adjusted R-squared: 0.9845
F-statistic: 161.5 on 15 and 23 DF, p-value: < 2.2e-16
Many predictors are not significant, suggesting possible multicollinearity. GDP per Capita has a very small coefficient (8.485e-05), suggesting it might benefit from log transformation. The residuals range from -2.08 to 1.61, which isn’t terrible but could potentially be improved.
GVIF Df GVIF^(1/(2*Df))
region 33.032280 4 1.548344
tfr 18.312361 1 4.279295
contraception 4.202452 1 2.049988
educationMale 38.164269 1 6.177724
educationFemale 49.808193 1 7.057492
lifeMale 13.991749 1 3.740555
infantMortality 23.267489 1 4.823639
GDPperCapita 2.559988 1 1.599996
economicActivityMale 3.281195 1 1.811407
economicActivityFemale 2.468757 1 1.571228
illiteracyMale 18.958728 1 4.354162
illiteracyFemale 33.081818 1 5.751680
do data transformation 1
model_transformed <- lm(lifeFemale ~ tfr + lifeMale +
log(GDPperCapita) + economicActivityFemale + log(educationFemale) + economicActivityMale+ contraception,
data = united_data_complete)
summary(model_transformed)
Call:
lm(formula = lifeFemale ~ tfr + lifeMale + log(GDPperCapita) +
economicActivityFemale + log(educationFemale) + economicActivityMale +
contraception, data = united_data_complete)
Residuals:
Min 1Q Median 3Q Max
-2.6189 -0.7120 -0.1288 0.7447 2.7710
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.054690 4.790977 3.977 0.000389 ***
tfr -2.056746 0.381891 -5.386 7.11e-06 ***
lifeMale 0.847174 0.059300 14.286 3.49e-15 ***
log(GDPperCapita) 0.183585 0.254628 0.721 0.476316
economicActivityFemale 0.004541 0.018056 0.251 0.803100
log(educationFemale) 0.466637 0.969913 0.481 0.633816
economicActivityMale 0.016441 0.036739 0.448 0.657619
contraception -0.031832 0.019021 -1.674 0.104278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.439 on 31 degrees of freedom
Multiple R-squared: 0.9831, Adjusted R-squared: 0.9793
F-statistic: 257.7 on 7 and 31 DF, p-value: < 2.2e-16
tfr lifeMale log(GDPperCapita)
5.955934 5.129213 2.158457
economicActivityFemale log(educationFemale) economicActivityMale
1.735643 2.627796 1.705966
contraception
2.731702
Analysis of Variance Table
Model 1: lifeFemale ~ region + tfr + contraception + educationMale + educationFemale +
lifeMale + infantMortality + GDPperCapita + economicActivityMale +
economicActivityFemale + illiteracyMale + illiteracyFemale
Model 2: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale +
log(educationFemale) + economicActivityMale + contraception
Res.Df RSS Df Sum of Sq F Pr(>F)
1 23 35.743
2 31 64.200 -8 -28.456 2.2889 0.05752 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From influential points and outliers, we can see Latvia, Indonesia, Botswana appear as potential outliers. The residuals vs fitted plot shows a slight pattern, suggesting potential non-linearity. The Q-Q plot shows some deviation from normality at the tails. There are high VIF values for several predictors:tfr (7.1), lifeMale (5.72), illiteracyFemale (6.7), suggestting a potential multicollinearity.I try to improve the model again. The anova test has a p-value (0.05752) which is significant at the level of 0.05, which indicates that the model 2 has a better fit than the original model.
data transformation 2
model_improved <- lm(lifeFemale ~ tfr + lifeMale +
log(GDPperCapita) + economicActivityFemale + log(educationFemale) ,
data = united_data_complete)
summary(model_improved)
Call:
lm(formula = lifeFemale ~ tfr + lifeMale + log(GDPperCapita) +
economicActivityFemale + log(educationFemale), data = united_data_complete)
Residuals:
Min 1Q Median 3Q Max
-2.7193 -0.7889 -0.2196 0.9811 3.2810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.448869 4.839185 3.812 0.000571 ***
tfr -1.741571 0.300306 -5.799 1.74e-06 ***
lifeMale 0.854868 0.052140 16.396 < 2e-16 ***
log(GDPperCapita) 0.148270 0.257217 0.576 0.568228
economicActivityFemale 0.005607 0.016670 0.336 0.738727
log(educationFemale) 0.073993 0.954033 0.078 0.938648
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.46 on 33 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9787
F-statistic: 350.1 on 5 and 33 DF, p-value: < 2.2e-16
tfr lifeMale log(GDPperCapita)
3.579619 3.853990 2.140758
economicActivityFemale log(educationFemale)
1.437828 2.471110
Analysis of Variance Table
Model 1: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale +
log(educationFemale) + economicActivityMale + contraception
Model 2: lifeFemale ~ tfr + lifeMale + log(GDPperCapita) + economicActivityFemale +
log(educationFemale)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 31 64.200
2 33 70.315 -2 -6.115 1.4764 0.2441
VIF values are improved (all < 4), indicating reduced multicollinearity. The p-value = 0.2441 (> 0.05) for the anova test means there’s no significant difference between Model 1 and Model 2. We can use the simpler Model 2 (removing economicActivityMale and contraception doesn’t significantly worsen the model).
For each one-unit increase in fertility rate, female life expectancy decreases by 1.74 years, and it’s highly significant.(p < 0.001) For each one-year increase in male life expectancy, female life expectancy increases by 0.85 years, and it’s highly significant.(p < 0.001) For GDP, Female Economic Activity, and Female Education, they are not statistically significant. The model explains 98.15% of the variance in female life expectancy.
- Now redo your analysis in (a) but use multiple imputation.
iter imp variable
1 1 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
1 2 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
1 3 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
1 4 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
1 5 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
2 1 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
2 2 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
2 3 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
2 4 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
2 5 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
3 1 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
3 2 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
3 3 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
3 4 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
3 5 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
4 1 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
4 2 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
4 3 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
4 4 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
4 5 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
5 1 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
5 2 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
5 3 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
5 4 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
5 5 tfr contraception educationMale educationFemale lifeMale lifeFemale infantMortality GDPperCapita economicActivityMale economicActivityFemale illiteracyMale illiteracyFemale
model_imp_complete <- with(imp, lm(lifeFemale ~ region + tfr + contraception + educationMale +
educationFemale + lifeMale + infantMortality + GDPperCapita +
economicActivityMale + economicActivityFemale +
illiteracyMale + illiteracyFemale))
pooled_complete <- pool(model_imp_complete)
summary(pooled_complete) term estimate std.error statistic df
1 (Intercept) 1.996385e+01 3.181402e+00 6.27517500 42.725077
2 regionAmerica 4.426463e-01 4.961102e-01 0.89223372 49.993902
3 regionAsia 4.272759e-02 4.111295e-01 0.10392733 43.711301
4 regionEurope 1.013285e+00 4.970044e-01 2.03878472 56.151029
5 regionOceania -4.474448e-01 5.119470e-01 -0.87400597 157.035616
6 tfr -6.983754e-01 1.626416e-01 -4.29395199 26.611063
7 contraception -2.428545e-02 1.021230e-02 -2.37806011 20.585202
8 educationMale 1.436599e-01 3.560962e-01 0.40343007 5.053972
9 educationFemale -1.311389e-01 3.738908e-01 -0.35074127 4.782813
10 lifeMale 8.547754e-01 3.761988e-02 22.72137563 68.363943
11 infantMortality -3.497512e-02 1.436239e-02 -2.43518806 9.664250
12 GDPperCapita 2.260934e-07 1.709956e-05 0.01322218 63.418593
13 economicActivityMale -1.342413e-02 2.147323e-02 -0.62515644 21.072391
14 economicActivityFemale 9.664915e-03 9.370237e-03 1.03144829 29.767437
15 illiteracyMale 4.732288e-02 2.758298e-02 1.71565504 24.656082
16 illiteracyFemale -5.867731e-02 2.354499e-02 -2.49213623 22.486989
p.value
1 1.491394e-07
2 3.765422e-01
3 9.177024e-01
4 4.618996e-02
5 3.834496e-01
6 2.077617e-04
7 2.718266e-02
8 7.031345e-01
9 7.406993e-01
10 1.411941e-33
11 3.592168e-02
12 9.894920e-01
13 5.385801e-01
14 3.106387e-01
15 9.876715e-02
16 2.053217e-02
est lo 95 hi 95 fmi
R^2 0.9847297 0.979497 0.9886347 0.1570263
lifeMale, tfr, contraception, infantMortality, illiteracyFemale, regionEurope are significant predictors.
Try improvement
model_imp_transformed <- with(imp,
lm(lifeFemale ~ region +
log(tfr) +
log(lifeMale) +
I(infantMortality^2) +
log(contraception) +
log(illiteracyFemale) + economicActivityFemale + log(GDPperCapita)))
pooled_transformed <- pool(model_imp_transformed)
summary(pooled_transformed) term estimate std.error statistic df
1 (Intercept) -1.569690e+02 7.569300e+00 -20.73758449 103.403561
2 regionAmerica 3.833838e-01 3.924632e-01 0.97686560 192.193737
3 regionAsia -6.171583e-01 3.596034e-01 -1.71621916 144.112170
4 regionEurope 1.922553e-01 4.830477e-01 0.39800485 89.122067
5 regionOceania -6.652831e-01 6.225888e-01 -1.06857536 20.653250
6 log(tfr) -2.716701e+00 4.517948e-01 -6.01313096 105.725811
7 log(lifeMale) 5.513827e+01 1.895414e+00 29.09036240 72.295930
8 I(infantMortality^2) 3.599214e-06 4.682285e-05 0.07686874 76.135919
9 log(contraception) -4.761709e-01 2.158387e-01 -2.20614191 124.006342
10 log(illiteracyFemale) -5.608013e-01 1.209680e-01 -4.63594873 13.878828
11 economicActivityFemale 1.269129e-02 6.904894e-03 1.83801315 94.814683
12 log(GDPperCapita) 3.612626e-01 1.401303e-01 2.57804700 9.204761
p.value
1 1.255615e-38
2 3.298637e-01
3 8.827070e-02
4 6.915784e-01
5 2.975921e-01
6 2.638188e-08
7 1.205907e-41
8 9.389296e-01
9 2.921619e-02
10 3.939025e-04
11 6.919171e-02
12 2.927951e-02
est lo 95 hi 95 fmi
R^2 0.9851895 0.97991 0.9890892 0.2196999
test statistic df1 df2 dfcom p.value riv
1 ~~ 2 228.9663 10 70.28529 191 1.744554e-49 1.11773
I have made some transformation of the model, from the summary, we can see that the overall R^2 has increased, and the p-value in the anova test shows that there is a significant difference between the original complete model and the improved one.
- Compare these results to those from the complete-case analysis. What do you conclude?
2. Exercise D20.3 (Selection)
Long (1997) reports a regression in which the response variable is the prestige of the academic departments where PhDs in biochemistry find their first jobs. The data are in the file Long-PhDs.txt.
Prestige is measured on a scale that runs from 1.00 to 5.00, and is unavailable for departments without graduate programs and for departments with ratings below 1.00. The explanatory variables include a dummy regressor for gender; the prestige of the department in which the individual obtained his or her PhD; the number of citations received by the individualís mentor; a dummy regressor coding whether or not the individual held a fellowship; the number of articles published by the individual; and the number of citations received by the individual.
Estimate the regression of prestige of first job on the other variables in three ways:
- code all of the missing values as 1.00 and perform an OLS regression;
phd_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/Long-PhDs.txt")
phd_data$gender <- factor(phd_data$gender)
phd_data$fellowship <- factor(phd_data$fellowship)
missing_count <- colSums(is.na(phd_data))
print("Missing values per column:")[1] "Missing values per column:"
job gender phd mentor fellowship articles citations
99 0 0 0 0 0 0
phd_data_a <- phd_data
phd_data_a$job[is.na(phd_data_a$job)] <- 1.00
model_a <- lm(job ~ gender + phd + mentor + fellowship + articles + citations,
data = phd_data_a)
summary(model_a)
Call:
lm(formula = job ~ gender + phd + mentor + fellowship + articles +
citations, data = phd_data_a)
Residuals:
Min 1Q Median 3Q Max
-1.80152 -0.70105 -0.03821 0.68599 2.17601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9279903 0.1641470 5.653 2.99e-08 ***
gendermale 0.1391939 0.0902344 1.543 0.1237
phd 0.2726826 0.0493183 5.529 5.81e-08 ***
mentor 0.0011867 0.0007012 1.692 0.0913 .
fellowshipyes 0.2341384 0.0948206 2.469 0.0140 *
articles 0.0228011 0.0288843 0.789 0.4303
citations 0.0044788 0.0019687 2.275 0.0234 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8717 on 401 degrees of freedom
Multiple R-squared: 0.2101, Adjusted R-squared: 0.1983
F-statistic: 17.78 on 6 and 401 DF, p-value: < 2.2e-16
The overall model is statistically significant (F-statistic: 17.78, p-value < 2.2e-16) and explains about 21% of the variance in job prestige. There is a significant positive relationship between the prestige of the department where individuals obtained their PhD and their first job prestige. For each one-point increase in PhD department prestige, first job prestige increases by 0.273 points on average, holding other factors constant.
Fellowship Status is significant in the 0.01 level. Individuals with fellowships tend to secure first jobs in departments with prestige ratings about 0.23 points higher than those without fellowships.
The number of citations received by the individual has a small but significant positive effect on job prestige. Each additional citation is associated with a 0.004 point increase in first job prestige.
The citations received by an individual’s mentor show a marginally significant positive effect, suggesting some benefit from having a well-cited mentor.
Gender: Being male is associated with a 0.139 point increase in job prestige compared to being female, but this effect is not statistically significant (p = 0.124).
The number of articles published shows a positive but non-significant relationship with job prestige (p = 0.430).
- treat the missing values as truncated at 1.00 and employ Heckmanís selection-regression model;
library(sampleSelection)
phd_data_b <- phd_data
phd_data_b$observed <- ifelse(!is.na(phd_data_b$job), 1, 0)
phd_data_b$job_trunc <- phd_data_b$job
phd_data_b$job_trunc[is.na(phd_data_b$job_trunc)] <- 1.00
heckman_model <- selection(
selection = observed ~ gender + phd + mentor + fellowship + articles + citations,
outcome = job_trunc ~ gender + phd + mentor + fellowship + articles + citations,
data = phd_data_b,
method = "2step"
)
summary(heckman_model)--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
408 observations (99 censored and 309 observed)
17 free parameters (df = 392)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.413907 0.262790 -1.575 0.11605
gendermale 0.453428 0.147240 3.080 0.00222 **
phd 0.128267 0.080552 1.592 0.11211
mentor 0.001035 0.001332 0.777 0.43786
fellowshipyes 0.259776 0.155315 1.673 0.09521 .
articles 0.052669 0.061404 0.858 0.39156
citations 0.010287 0.005373 1.914 0.05629 .
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3932373 0.6480052 2.150 0.0322 *
gendermale -0.0758877 0.1574188 -0.482 0.6300
phd 0.3050634 0.0611690 4.987 9.22e-07 ***
mentor 0.0008408 0.0006884 1.221 0.2227
fellowshipyes 0.1603983 0.1361823 1.178 0.2396
articles 0.0084220 0.0279506 0.301 0.7633
citations 0.0023844 0.0021991 1.084 0.2789
Multiple R-Squared:0.2014, Adjusted R-Squared:0.1829
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio 0.1332 0.6905 0.193 0.847
sigma 0.7001 NA NA NA
rho 0.1903 NA NA NA
--------------------------------------------
For selection equation: Gender: Being male significantly increases the probability of having an observed job prestige value (p < 0.01). This suggests that male PhDs are more likely to find jobs in departments where prestige ratings are available. Having a fellowship shows a marginally significant positive effect on having an observed prestige value, indicating that fellowship holders may be more likely to secure positions in departments with available prestige ratings. Individual citations have a marginally significant positive effect (β = 0.010, p < 0.10) on whether job prestige is observed, suggesting that more cited individuals are somewhat more likely to have jobs with measurable prestige. PhD department prestige, mentor citations, and number of articles do not significantly predict whether job prestige is observed.
For outcome equation: only phd has the most significant impact on the outcome. Other predictors don’t show significant impact on job prestige. The inverse Mills ratio is not statistically significant (p = 0.847), suggesting that selection bias may not be a major concern in this model. The low rho value (0.1903) also indicates a relatively weak correlation between the error terms in the selection and outcome equations.
- treat the missing values as censored and fit the Tobit model.
library(AER)
phd_data_c <- phd_data
phd_data_c$censored <- is.na(phd_data_c$job)
phd_data_c$job_censored <- phd_data_c$job
phd_data_c$job_censored[is.na(phd_data_c$job_censored)] <- 1.00
tobit_model <- tobit(job_censored ~ gender + phd + mentor + fellowship + articles + citations,
left = 1.00,
data = phd_data_c)
summary(tobit_model)
Call:
tobit(formula = job_censored ~ gender + phd + mentor + fellowship +
articles + citations, left = 1, data = phd_data_c)
Observations:
Total Left-censored Uncensored Right-censored
408 99 309 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4485575 0.2178915 2.059 0.0395 *
gendermale 0.2368486 0.1165852 2.032 0.0422 *
phd 0.3225846 0.0639229 5.046 4.5e-07 ***
mentor 0.0013436 0.0008876 1.514 0.1301
fellowshipyes 0.3252657 0.1224575 2.656 0.0079 **
articles 0.0339053 0.0365017 0.929 0.3530
citations 0.0050900 0.0024752 2.056 0.0397 *
Log(scale) 0.0836396 0.0428043 1.954 0.0507 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Scale: 1.087
Gaussian distribution
Number of Newton-Raphson Iterations: 4
Log-likelihood: -560.3 on 8 Df
Wald-statistic: 96.21 on 6 Df, p-value: < 2.22e-16
The model is highly significant (Wald-statistic: 96.21, p-value < 2.22e-16). The prestige of the department where individuals obtained their PhD is the strongest predictor of first job prestige. For each one-point increase in PhD department prestige, the expected job prestige increases by about 0.32 points, holding other factors constant.
Having held a fellowship significantly increases expected job prestige. Individuals with fellowships tend to secure first jobs in departments with prestige ratings about 0.33 points higher than those without fellowships.
Being male is associated with higher job prestige (β = 0.237, p < 0.05). Male PhDs tend to secure first jobs in departments with prestige ratings about 0.24 points higher than female PhDs, after controlling for other factors.
The number of citations received by the individual has a small but significant positive effect. Each additional citation is associated with a 0.005 point increase in expected job prestige.
Neither mentor citations nor the number of articles published show statistically significant effects on job prestige, though both have positive coefficients.
- Compare the estimates and coefficient standard errors obtained by the three approaches. Which of these approaches makes the most substantive sense?
ols_results <- summary(model_a)$coefficients
heckman_results <- summary(heckman_model)$estimate
tobit_results <- summary(tobit_model)$coefficients
common_vars <- Reduce(intersect, list(rownames(ols_results), rownames(heckman_results), rownames(tobit_results)))
ols_subset <- ols_results[common_vars, , drop=FALSE]
heckman_subset <- heckman_results[common_vars, , drop=FALSE]
tobit_subset <- tobit_results[common_vars, , drop=FALSE]
comparison_df <- data.frame(
Variable = common_vars,
OLS_Coef = ols_subset[, 1],
OLS_StdErr = ols_subset[, 2],
Heckman_Coef = heckman_subset[, 1],
Heckman_StdErr = heckman_subset[, 2],
Tobit_Coef = tobit_subset[, 1],
Tobit_StdErr = tobit_subset[, 2]
)
print(comparison_df) Variable OLS_Coef OLS_StdErr Heckman_Coef
(Intercept) (Intercept) 0.927990303 0.1641469701 -0.413907492
gendermale gendermale 0.139193935 0.0902344238 0.453428161
phd phd 0.272682644 0.0493183451 0.128267082
mentor mentor 0.001186708 0.0007011642 0.001034672
fellowshipyes fellowshipyes 0.234138434 0.0948206487 0.259775621
articles articles 0.022801124 0.0288842707 0.052668574
citations citations 0.004478843 0.0019686646 0.010287045
Heckman_StdErr Tobit_Coef Tobit_StdErr
(Intercept) 0.262790471 0.448557544 0.2178915284
gendermale 0.147240499 0.236848569 0.1165851688
phd 0.080551984 0.322584618 0.0639228959
mentor 0.001332314 0.001343628 0.0008875672
fellowshipyes 0.155315145 0.325265728 0.1224574981
articles 0.061404494 0.033905332 0.0365017207
citations 0.005373294 0.005090006 0.0024751689
PhD is consistently significant across all three approaches. This suggests that PhD department prestige is a robust predictor of first job prestige regardless of how missing data is handled. Gender Effects show striking inconsistency. In the Heckman model, gender has a large positive coefficient (0.453) in the selection equation but non-significant in the OLS approach (0.139). The Tobit model shows a stronger effect (0.237) than OLS. This suggests that gender may influence both the selection process and outcome differently. Fellowship Status has consistent positive effects in OLS (0.234) and Tobit (0.325) but shows up primarily in the selection equation of the Heckman model (0.260), suggesting it affects whether one gets a job with measurable prestige rather than the prestige level itself.
The Tobit model makes the most substantive sense for several reasons: 1. The Heckman model reveals that gender and citations affect whether job prestige is observed, indicating a selection process that Tobit can handle more parsimoniously. 2. The Tobit model produces estimates that generally align with theoretical expectations about academic job markets, where PhD prestige, fellowship status, gender, and citations all matter. 3. Tobit generally has more reasonable standard errors than Heckman while accounting for the censoring mechanism (unlike OLS).
3. Exercise (Bootstrap)
We will now consider the Boston housing dataset from the MASS package.
Run ??Boston in condole to see codebook.
- Provide an estimate of the population mean of
medv. Call this estimate \(\hat{\mu}\).
- What is the formula for the standard error of an estimate of the mean? Use this to provide an estimate of the standard error of \(\hat{\mu}\) in (a).
- Estimate this standard error using the bootstrap. How does this compare to the answer from (b)?
mean_func <- function(data, indices) {
return(mean(data[indices]))
}
set.seed(123)
bootstrap_results <- boot(data = Boston$medv,
statistic = mean_func,
R = 10000)
print(bootstrap_results)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston$medv, statistic = mean_func, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 22.53281 -0.002135435 0.4088195
# The bootstrap standard error
bootstrap_se <- sd(bootstrap_results$t)
cat("Classical estimate of standard error of mean:", SE_mu_hat, "\n")Classical estimate of standard error of mean: 0.4088611
Bootstrap estimate of standard error of mean: 0.4088195
The bootstrap estimate of standard error is slightly smaller than the classical estimate of standard error, which makes sense since the bootstrap procedure is able to decrease the level of error, which will make the standard error to be smaller.
- Provide an estimate of \(\hat{\mu}_{med}\), the median value of
medvin the population.
- Estimate the standard error of \(\hat{\mu}_{med}\). Notice that there is no simple formula to do this, so instead use the bootstrap. Comment on your findings.
median_func <- function(data, indices) {
return(median(data[indices]))
}
set.seed(123)
bootstrap_results <- boot(data = Boston$medv,
statistic = median_func,
R = 10000)
median_estimate <- bootstrap_results$t0
median_se <- sd(bootstrap_results$t)
cat("Bootstrap estimate of standard error of median:", median_se, "\n")Bootstrap estimate of standard error of median: 0.3773774
The histogram of medv shows a right-skewed distribution, meaning there are high-value outliers. The standard error of the median is smaller than the standard error of the mean. This suggests that the distribution of medv may have some variability, but the median is slightly more stable. If the distribution were perfectly symmetric, we would expect the SEs to be almost equal. The mean is more sensitive to outliers, which could contribute to its slightly higher variability.
4. Exercise D22.1 (Model Selection)
The data file BaseballPitchers.txt contains salary and performance data for major-league baseball pitchers at the start of the 1987 season. The data are analogous to those for baseball hitters used as an example in the chapter. Be sure to explore the data and think about variables to use as predictors before specifying candidate models.
- Employing one or more of the methods of model selection described in the text, develop a regression model to predict pitchers’ salaries.
baseball_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/BaseballPitchers.txt", header=TRUE)
baseball_clean <- baseball_data %>% filter(!is.na(salary))
baseball_clean$league86 <- as.factor(baseball_clean$league86)
baseball_clean$league87 <- as.factor(baseball_clean$league87)
baseball_clean$team86 <- as.factor(baseball_clean$team86)
baseball_clean$team87 <- as.factor(baseball_clean$team87)Use AIC and BIC
# Backward selection using AIC
backward_aic <- step(full_model, direction = "backward", trace = FALSE)
summary(backward_aic)$adj.r.squared[1] 0.4880959
[1] 2494.15
# Backward selection using BIC
backward_bic <- stepAIC(full_model, direction="backward", trace = FALSE, k=log(nrow(baseball_clean)))
summary(backward_bic)$adj.r.squared[1] 0.3810568
[1] 2520.265
# Stepwise selection using AIC
stepwise_aic <- step(null_model,
scope = list(lower = formula(null_model),
upper = formula(full_model)),
direction = "both",trace = FALSE)
summary(stepwise_aic)$adj.r.squared[1] 0.4879433
[1] 2493.403
# Stepwise selection using BIC
stepwise_bic <- stepAIC(full_model, direction="both", trace = FALSE, k=log(nrow(baseball_clean)))
summary(stepwise_bic)$adj.r.squared[1] 0.397887
[1] 2519.557
AIC models explain more variance (higher R^2 value) but are more complex. BIC models are more parsimonious but explain less variance. According to the model selection using AIC, the stepwise selection performed slightly better than backward selection for both AIC and BIC. The differences are small, but this suggests that some variables initially excluded in backward selection might be valuable predictors when considered in combination with others.
final model
Call:
lm(formula = salary ~ years + careerERA + IP86 + team87 + careerSV +
league87, data = baseball_clean)
Residuals:
Min 1Q Median 3Q Max
-609.2 -158.1 -26.6 125.1 871.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.6001 226.9173 0.254 0.799976
years 34.2776 5.5267 6.202 5.36e-09 ***
careerERA -113.6827 41.9487 -2.710 0.007528 **
IP86 1.9527 0.4111 4.750 4.78e-06 ***
team87Bal. 443.1004 172.2881 2.572 0.011107 *
team87Bos. 366.8161 172.5102 2.126 0.035145 *
team87Cal. 253.6172 194.5625 1.304 0.194433
team87Chi. 489.7911 129.4280 3.784 0.000224 ***
team87Cin. 160.9367 144.1991 1.116 0.266212
team87Cle. -44.2065 178.0179 -0.248 0.804229
team87Det. 548.6312 173.3258 3.165 0.001883 **
team87Hou. 34.7395 134.1054 0.259 0.795962
team87K.C. 385.5194 174.7426 2.206 0.028921 *
team87L.A. 281.2451 133.5009 2.107 0.036843 *
team87Mil. 217.7212 186.0869 1.170 0.243895
team87Min. 380.3508 173.7509 2.189 0.030170 *
team87Mon. -148.4768 144.1560 -1.030 0.304714
team87N.Y. 255.4010 133.5029 1.913 0.057683 .
team87Oak. 292.4248 163.7470 1.786 0.076188 .
team87Phi. 43.6651 138.4021 0.315 0.752834
team87Pit. 57.3938 139.5902 0.411 0.681554
team87S.D. 147.4488 138.6138 1.064 0.289192
team87S.F. 6.6883 144.6113 0.046 0.963174
team87Sea. 199.1914 187.0554 1.065 0.288677
team87St.L. 82.6906 145.6265 0.568 0.571019
team87Tex. 197.2861 169.4256 1.164 0.246132
team87Tor. 310.1011 174.6704 1.775 0.077909 .
careerSV 1.3671 0.6250 2.187 0.030311 *
league87N 207.4445 105.3955 1.968 0.050921 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 266.2 on 147 degrees of freedom
Multiple R-squared: 0.5699, Adjusted R-squared: 0.4879
F-statistic: 6.956 on 28 and 147 DF, p-value: 8.51e-16
- How successful is the model in predicting salaries? Does the model make substantive sense?
test for model’s VIF value
GVIF Df GVIF^(1/(2*Df))
years 1.567258 1 1.251902
careerERA 1.668797 1 1.291819
IP86 1.655416 1 1.286630
team87 12.336876 23 1.056141
careerSV 1.914102 1 1.383511
league87 6.866400 1 2.620382
From here, we can see that all six predictors have VIF value smaller than 5, which indicates that there is no possible issue for problematic multicollinearity.
use Cross-Validation
library(caret)
cv_results <- train(salary ~ years + careerERA + IP86 + team87 + careerSV + league87,
data = baseball_clean,
method = "lm",
trControl = trainControl(method = "cv", number = 10))
print(cv_results)Linear Regression
176 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 157, 160, 157, 160, 159, 158, ...
Resampling results:
RMSE Rsquared MAE
287.4584 0.4387928 221.0161
Tuning parameter 'intercept' was held constant at a value of TRUE
The model explains about 44% of the variance in pitcher salaries when applied to new data. This is lower than the adjusted R-squared of 0.4879 from the full model. MAE is lower than RMSE, indicating some large errors are increasing the RMSE.
For the drop in R-squared, the cross-validation R-squared (0.4397) is about 9% lower than your model’s adjusted R-squared (0.4879). This indicates some overfitting, but not severe.
The cross-validation RMSE (299.17) is higher than the residual standard error from your full model (266.2), showing a roughly 12% increase in prediction error when applied to new data.
For practical scenario, the average error (RMSE 300,000) needs to be interpreted relative to the typical salary range. If the average salary is around 600,000-800,000, this represents a substantial percentage error.